By the end of this lab, you will: 1. Load and analyze the Lightcast dataset in Spark DataFrame. 2. Create five easy and three medium-complexity visualizations using Plotly. 3. Explore salary distributions, employment trends, and job postings. 4. Analyze skills in relation to NAICS/SOC/ONET codes and salaries. 5. Customize colors, fonts, and styles in all visualizations (default themes result in a 2.5-point deduction). 6. Follow best practices for reporting on data communication.
Step 1: Load the Dataset
import pandas as pdimport plotly.express as pximport plotly.io as piopio.renderers.default ="vscode"from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col# Initialize Spark Sessionspark = SparkSession.builder.appName("LightcastData").getOrCreate()# Load Datadf = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("/home/ubuntu/github-classroom/met-ad-688/lab06-zimozeng12/job_postings.csv")# Show Schema and Sample Datadf.printSchema()df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/21 03:05:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/21 03:05:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/03/21 03:05:34 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Identify salary trends across different employment types.
Filter the dataset
Remove records where salary is missing or zero.
Aggregate Data
Group by employment type and compute salary distribution.
Visualize results
Create a box plot where:
X-axis = EMPLOYMENT_TYPE_NAME
Y-axis = SALARY_FROM
Customize colors, fonts, and styles to avoid a 2.5-point deduction.
Explanation: Write two sentences about what the graph reveals.
pio.renderers.default ="notebook"# Filter data where SALARY_FROM is not null and greater than 0df_salary_filtered = df.select("EMPLOYMENT_TYPE_NAME", "SALARY_FROM") \ .filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") >0))# Convert to Pandas DataFramepdf_salary = df_salary_filtered.toPandas()# Create a customized box plotfig = px.box( pdf_salary, x="EMPLOYMENT_TYPE_NAME", y="SALARY_FROM", title="Salary Distribution by Employment Type", color_discrete_sequence=["#2CA02C"] # Custom color)fig.update_layout( title_font=dict(size=22, family="Arial Black"), xaxis_title="Employment Type", yaxis_title="Salary From (USD)", plot_bgcolor="rgba(245, 245, 245, 1)", paper_bgcolor="rgba(255, 255, 255, 1)", font=dict(family="Verdana", size=14),)fig.show()fig.write_image("output/Salary Distribution by Employment Type.svg")
The box plot indicates that full-time positions (> 32 hours) offer higher starting salaries on average compared to part-time roles. Additionally, the full-time category exhibits a wider salary range and more high-end outliers, suggesting greater earning potential and variability in compensation.
2 Salary Distribution by Industry
Compare salary variations across industries.
Filter the dataset
Keep records where salary is greater than zero.
Aggregate Data
Group by NAICS industry codes.
Visualize results
Create a box plot where:
X-axis = NAICS2_NAME
Y-axis = SALARY_FROM
Customize colors, fonts, and styles.
Explanation: Write two sentences about what the graph reveals.
# Filter SALARY_FROM > 0 and NAICS2_NAME not nulldf_industry_salary = df.select("NAICS2_NAME", "SALARY_FROM") \ .filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") >0) & (col("NAICS2_NAME").isNotNull()))# Convert to Pandas for visualizationpdf_industry = df_industry_salary.toPandas()# Remove 'Unclassified Industry' entries (case insensitive just in case)pdf_industry = pdf_industry[~pdf_industry["NAICS2_NAME"].str.lower().str.contains("unclassified")]# Create box plot with custom styleimport plotly.express as pxfig = px.box( pdf_industry, x="NAICS2_NAME", y="SALARY_FROM", title="Salary Distribution by Industry (NAICS2)", color_discrete_sequence=["#1F77B4"] # Custom color)# Custom styling to avoid deductionfig.update_layout( title_font=dict(size=22, family="Arial Black"), xaxis_title="Industry (NAICS2)", yaxis_title="Salary From (USD)", plot_bgcolor="rgba(240, 240, 240, 1)", paper_bgcolor="rgba(255, 255, 255, 1)", font=dict(family="Verdana", size=14), xaxis_tickangle=45, height=600)fig.show()fig.write_image("output/Salary Distribution by Industry.svg")
The box plot shows that salary levels vary significantly across industries, with some sectors displaying wider ranges and higher median values. Industries such as Information and Professional Services offer relatively higher salaries, while sectors like Retail and Administrative Services tend to have lower and more compressed salary distributions.
3 Job Posting Trends Over Time
Analyze how job postings fluctuate over time.
Aggregate Data
Count job postings per posted date (POSTED).
Visualize results
Create a line chart where:
X-axis = POSTED
Y-axis = Number of Job Postings
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Select POSTED date and filter out nullsdf_posted = df.select("POSTED").filter(col("POSTED").isNotNull())# Convert to Pandaspdf_posted = df_posted.toPandas()# Count job postings per datepostings_by_date = pdf_posted.groupby("POSTED").size().reset_index(name="Job Postings")# Create line chart with custom stylingfig = px.line( postings_by_date, x="POSTED", y="Job Postings", title="Job Posting Trends Over Time", markers=True,)fig.update_traces(line=dict(color="#D62728", width=2)) # Custom colorfig.update_layout( title_font=dict(size=22, family="Arial Black"), xaxis_title="Date Posted", yaxis_title="Number of Job Postings", plot_bgcolor="rgba(245, 245, 245, 1)", paper_bgcolor="rgba(255, 255, 255, 1)", font=dict(family="Verdana", size=14), height=500)fig.show()fig.write_image("output/Job Posting Trends Over Time.svg")
The line chart reveals frequent fluctuations in daily job postings, with noticeable spikes occurring periodically throughout the observed months. This indicates dynamic hiring patterns, possibly influenced by short-term business needs or seasonal demand.
4 Top 10 Job Titles by Count
Identify the most frequently posted job titles.
Aggregate Data
Count the occurrences of each job title (TITLE_NAME).
Select the top 10 most frequent titles.
Visualize results
Create a bar chart where:
X-axis = TITLE_NAME
Y-axis = Job Count
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Select TITLE_NAME and filter out nullsdf_titles = df.select("TITLE_NAME").filter(col("TITLE_NAME").isNotNull())# Convert to Pandaspdf_titles = df_titles.toPandas()# Remove 'Unclassified' job titles (case insensitive just in case)pdf_titles = pdf_titles[~pdf_titles["TITLE_NAME"].str.lower().str.contains("unclassified")]# Count job title frequenciestitle_counts = pdf_titles["TITLE_NAME"].value_counts().nlargest(10).reset_index()title_counts.columns = ["Job Title", "Job Count"]# Create custom-styled bar chartfig = px.bar( title_counts, x="Job Title", y="Job Count", title="Top 10 Job Titles by Count (Excluding Unclassified)", text="Job Count", color_discrete_sequence=["#9467BD"] # Custom color)fig.update_layout( title_font=dict(size=18, family="Arial Black"), xaxis_title="Job Title", yaxis_title="Number of Postings", plot_bgcolor="rgba(240,240,240,1)", paper_bgcolor="rgba(255,255,255,1)", font=dict(family="Verdana", size=14), xaxis_tickangle=45, height=700)fig.update_traces(textposition='outside')fig.show()fig.write_image("output/Top 10 Job Titles by Count.svg")
The bar chart reveals that Data Analysts are by far the most frequently posted job title, significantly outpacing all other roles. Other top titles such as Business Intelligence Analysts and Enterprise Architects also show notable demand, highlighting the importance of data-driven and strategic roles in the job market.
5 Remote vs On-Site Job Postings
Compare the proportion of remote and on-site job postings.
Aggregate Data
Count job postings by remote type (REMOTE_TYPE_NAME).
Visualize results
Create a pie chart where:
Labels = REMOTE_TYPE_NAME
Values = Job Count
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Select REMOTE_TYPE_NAME and filter nulls + '[None]'df_remote = df.select("REMOTE_TYPE_NAME") \ .filter(col("REMOTE_TYPE_NAME").isNotNull()) \ .filter(col("REMOTE_TYPE_NAME") !="[None]")# Convert to Pandaspdf_remote = df_remote.toPandas()# Count by remote typeremote_counts = pdf_remote["REMOTE_TYPE_NAME"].value_counts().reset_index()remote_counts.columns = ["Remote Type", "Job Count"]# Custom Pie Chartfig = px.pie( remote_counts, names="Remote Type", values="Job Count", title="Remote vs On-Site Job Postings", color_discrete_sequence=px.colors.sequential.Tealgrn)fig.update_layout( title_font=dict(size=22, family="Arial Black"), font=dict(family="Verdana", size=14), paper_bgcolor="rgba(255,255,255,1)",)fig.update_traces(textinfo="percent+label", textfont_size=14)fig.show()fig.write_image("output/Remote vs On-Site Job Postings.svg")
The pie chart indicates that fully remote positions account for the majority of job postings, with hybrid remote roles also representing a significant share. In contrast, on-site jobs make up a smaller portion, reflecting the growing shift toward flexible work arrangements.
6 Skill Demand Analysis by Industry (Stacked Bar Chart)
Identify which skills are most in demand in various industries.
Aggregate Data
Extract skills from job postings.
Count occurrences of skills grouped by NAICS industry codes.
Visualize results
Create a stacked bar chart where:
X-axis = Industry
Y-axis = Skill Count
Color = Skill
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Select industry and skill fields, filter out nullsdf_skills = df.select("NAICS2_NAME", "COMMON_SKILLS_NAME") \ .filter(col("NAICS2_NAME").isNotNull() & col("COMMON_SKILLS_NAME").isNotNull())# Convert to Pandaspdf_skills = df_skills.toPandas()# Convert skill strings to Python listsimport astpdf_skills["COMMON_SKILLS_NAME"] = pdf_skills["COMMON_SKILLS_NAME"].apply(ast.literal_eval)# Flatten to (Industry, Skill) rowsexploded = pdf_skills.explode("COMMON_SKILLS_NAME")exploded = exploded.rename(columns={"COMMON_SKILLS_NAME": "Skill", "NAICS2_NAME": "Industry"})# Remove "Unclassified Industry"exploded_filtered = exploded[exploded["Industry"] !="Unclassified Industry"]# Count skills per industryskill_counts = exploded_filtered.groupby(["Industry", "Skill"]).size().reset_index(name="Count")# Identify top 5 most frequent skills overalltop_skills = skill_counts.groupby("Skill")["Count"].sum().nlargest(5).index.tolist()# Filter data to include only top 5 skillsskill_counts_filtered = skill_counts[skill_counts["Skill"].isin(top_skills)]# Get total skill count per industryindustry_totals = skill_counts_filtered.groupby("Industry")["Count"].sum().nlargest(10).index.tolist()# Filter to top 10 industriesskill_counts_top_industries = skill_counts_filtered[skill_counts_filtered["Industry"].isin(industry_totals)]# Create stacked bar chartfig = px.bar( skill_counts_top_industries, x="Industry", y="Count", color="Skill", title="Top 5 In-Demand Skills by Industry (Top 10 Industries)", text="Count", color_discrete_sequence=px.colors.qualitative.Pastel)fig.update_layout( title_font=dict(size=22, family="Arial Black"), xaxis_title="Industry (NAICS2)", yaxis_title="Skill Count", plot_bgcolor="rgba(240,240,240,1)", paper_bgcolor="rgba(255,255,255,1)", font=dict(family="Verdana", size=14), xaxis_tickangle=45, barmode='stack', height=850)fig.update_traces(textfont_size=12, textposition='inside')fig.show()fig.write_image("output/Top 5 In-Demand Skills by Industry (Top 10 Industries).svg")
The stacked bar chart displays the top five most common skills across the ten industries with the highest demand, offering clear insight into industry-specific skill requirements. “Communication” and “Management” skills are particularly prominent in Professional and Administrative Services, while industries such as Finance and Information Technology show strong demand for problem-solving and operations skills.
7 Salary Analysis by ONET Occupation Type (Bubble Chart)
Analyze how salaries differ across ONET occupation types.
Aggregate Data
Compute median salary for each occupation in the ONET taxonomy.
Visualize results
Create a bubble chart where:
X-axis = ONET_NAME
Y-axis = Median Salary
Size = Number of job postings
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Select ONET occupation and salary, filter out null and zero salariesdf_onet_salary = df.select("ONET_NAME", "SALARY_FROM") \ .filter(col("ONET_NAME").isNotNull() & col("SALARY_FROM").isNotNull() & (col("SALARY_FROM") >0))# Convert to Pandas DataFramepdf_onet_salary = df_onet_salary.toPandas()# Group by ONET occupation: calculate median salary and job countonet_salary_stats = pdf_onet_salary.groupby("ONET_NAME").agg( Median_Salary=("SALARY_FROM", "median"), Job_Count=("SALARY_FROM", "count")).reset_index()# Filter out occupations with too few postingsonet_salary_stats = onet_salary_stats[onet_salary_stats["Job_Count"] >=10]# Determine how many occupations to displayiflen(onet_salary_stats) >10: onet_to_plot = onet_salary_stats.sort_values("Job_Count", ascending=False).head(10)else: onet_to_plot = onet_salary_stats # Show all if less than 10# Create bubble chartfig = px.scatter( onet_to_plot, x="ONET_NAME", y="Median_Salary", size="Job_Count", title="Salary Analysis by ONET Occupation Type", text="Job_Count", color_discrete_sequence=["#17BECF"], size_max=60)fig.update_layout( title_font=dict(size=22, family="Arial Black"), xaxis_title="ONET Occupation Type", yaxis_title="Median Salary (USD)", plot_bgcolor="rgba(240,240,240,1)", paper_bgcolor="rgba(255,255,255,1)", font=dict(family="Verdana", size=14), xaxis_tickangle=45, height=600)fig.update_traces(textposition='top center', textfont_size=12)fig.show()fig.write_image("output/Salary Analysis by ONET Occupation Type.svg")
The bubble chart shows that all valid salary data is concentrated in the occupation “Business Intelligence Analysts”, with a median salary of $88,000 and over 32,000 job postings. This suggests an exceptionally high demand for this role, potentially overshadowing other occupations due to incomplete salary reporting.
8 Career Pathway Trends (Sankey Diagram)
Visualize job transitions between different occupation levels.
Aggregate Data
Identify career transitions between SOC job classifications.
Visualize results
Create a Sankey diagram where:
Source = SOC_2021_2_NAME
Target = SOC_2021_3_NAME
Value = Number of transitions
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Select finer SOC classifications for career flow analysisdf_soc_alt = df.select("SOC_2021_3_NAME", "SOC_2021_4_NAME") \ .filter(col("SOC_2021_3_NAME").isNotNull() & col("SOC_2021_4_NAME").isNotNull())# Convert to Pandaspdf_soc_alt = df_soc_alt.toPandas()# Group by SOC level 3 ➡️ level 4 to count transitionssoc_counts_alt = pdf_soc_alt.groupby(["SOC_2021_3_NAME", "SOC_2021_4_NAME"]).size().reset_index(name="Count")# Keep only top 10 most common transitions for readabilitysoc_counts_alt = soc_counts_alt.sort_values("Count", ascending=False).head(10)# Create unique label list and mapping to indiceslabels =list(set(soc_counts_alt["SOC_2021_3_NAME"]).union(set(soc_counts_alt["SOC_2021_4_NAME"])))label_map = {name: idx for idx, name inenumerate(labels)}# Map names to indices for source and targetsoc_counts_alt["source_idx"] = soc_counts_alt["SOC_2021_3_NAME"].map(label_map)soc_counts_alt["target_idx"] = soc_counts_alt["SOC_2021_4_NAME"].map(label_map)# Create Sankey diagramimport plotly.graph_objects as gofig = go.Figure(data=[go.Sankey( node=dict( pad=20, thickness=20, line=dict(color="black", width=0.5), label=labels, color="lightblue" ), link=dict( source=soc_counts_alt["source_idx"], target=soc_counts_alt["target_idx"], value=soc_counts_alt["Count"], color="rgba(31,119,180,0.4)"# Custom transparent blue ))])fig.update_layout( title_text="Career Pathway Trends by SOC Classification (Level 3 to 4)", font=dict(size=14, family="Verdana"), title_font=dict(size=22, family="Arial Black"), paper_bgcolor="white", plot_bgcolor="white", height=700)fig.show()fig.write_image("output/Career Pathway Trends by SOC Classification (Level 3 to 4).svg")
The Sankey diagram illustrates a highly concentrated career pathway from “Mathematical Science Occupations” to “Data Scientists”. This dominant transition suggests that a significant portion of job postings within mathematical fields are targeted specifically at data science roles, indicating a clear and specialized career trajectory within this occupational cluster.